Skip to content

feat(quantized): GGUF-compat Q4_0 quant/dequant for burn QuantValue::Q4F/Q4S (sprint A5)#120

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/burn-A5-q4-quant
Apr 30, 2026
Merged

feat(quantized): GGUF-compat Q4_0 quant/dequant for burn QuantValue::Q4F/Q4S (sprint A5)#120
AdaWorldAPI merged 1 commit into
masterfrom
claude/burn-A5-q4-quant

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Sprint A5 of burn-ndarray parity sprint v1. Closes item (11) of the parity list — Q4 quant helpers needed for burn's QuantValue::Q4F / Q4S.

Existing quantize_f32_to_i4 audit

Pre-existing impl at src/hpc/quantized.rs:355:

  • pub fn quantize_f32_to_i4(data: &[f32]) -> (Vec<u8>, QuantParams) — already public
  • Per-tensor symmetric: single scale = abs_max / 7.0, zero_point = 0
  • Packing: low nibble first (element 0 → low nibble of byte 0; element 1 → high nibble of byte 0; consecutive layout)
  • Range clamped to [-8, 7], sign-extended on dequant
  • Does NOT match GGUF Q4_0 (different block size, scale formula, and packing layout)

Decision

Option (a) additive — kept existing quantize_f32_to_i4 untouched (no breaking change to existing callers). Added new GGUF-compat functions alongside.

What's new (+211 LOC)

src/hpc/quantized.rs:466-676:

pub const Q4_0_BLOCK_SIZE: usize = 32;
pub const Q4_0_BYTES_PER_BLOCK: usize = 16;

/// Q4_0 packing — GGUF / llama.cpp compatible.
/// Per 32-element block: scale `d = max_signed / -8`, packed as 16 bytes
/// where byte `j` holds element `j` (low nibble) and `j+16` (high nibble).
pub fn quantize_f32_to_q4_0(data: &[f32]) -> (Vec<u8>, Vec<f32>);

/// Inverse — asserts on (packed.len(), scales.len()) consistency.
pub fn dequantize_q4_0_to_f32(packed: &[u8], scales: &[f32]) -> Vec<f32>;

The packing layout is the exact GGUF Q4_0 interleave (not the linear layout quantize_f32_to_i4 uses), matching what llama.cpp produces.

Tests (6 new, 17/17 pass)

  • test_i4_boundary_values — exact boundaries at ±7 (scale=1.0) and clamp at ±8 (scale=8/7)
  • test_q4_0_roundtrip_single_block — 32 floats round-trip
  • test_q4_0_roundtrip_multi_block — 3-block (96 floats)
  • test_q4_0_zero_block — d=0 edge case
  • test_q4_0_packing_layout_interleaved — asserts byte j holds elements j and j+16
  • test_q4_0_requires_block_aligned#[should_panic] for non-32-multiple input

Acceptance

  • cargo build: clean (existing 39 warnings, none new)
  • cargo test --lib quantized: 17 passed, 0 failed
  • cargo fmt --check: only pre-existing diffs in code A5 didn't touch; new code is fmt-clean

Plan reference

Notes

GPG-signed commit (A5 worked around the env's codesign-helper quirk by mirror-committing in /home/user/ndarray and fetching the SHA into its worktree). Same key as recent master commits.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7609ccd67f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/quantized.rs
Comment on lines +526 to +529
let lo = ((block[j] * id).round() + 8.5).floor().clamp(0.0, 15.0) as u8;
let hi = ((block[j + Q4_0_BYTES_PER_BLOCK] * id).round() + 8.5)
.floor()
.clamp(0.0, 15.0) as u8;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use GGUF Q4_0 quantizer rounding rule

quantize_f32_to_q4_0 currently computes each nibble with ((x * id).round() + 8.5).floor(), but GGUF/llama.cpp Q4_0 uses truncation of x * id + 8.5 (effectively floor(x * id + 8.5) for this nonnegative range). These are not equivalent for negative half-step inputs (e.g. x*id = -0.5 gives 7 here vs 8 in GGUF), so this can produce different packed bytes from the same weights and break the advertised byte-level compatibility with existing Q4_0 tensors.

Useful? React with 👍 / 👎.

Add quantize_f32_to_q4_0 / dequantize_q4_0_to_f32 implementing the
GGUF / llama.cpp per-32-element block scheme: 16 packed bytes plus
one f32 scale d = max_signed/-8 per block, with the canonical
interleaved nibble layout (element j -> low nibble of byte j;
element j+16 -> high nibble of byte j).

The existing per-tensor quantize_f32_to_i4 (low-nibble-first,
non-interleaved, scale = abs_max/7) is preserved unchanged for
backwards compatibility. Burn QuantValue::Q4F / Q4S callers can
opt into either scheme.

Tests: i4 boundary +/-7 and clamp +/-8; Q4_0 single-block,
multi-block, zero-block, interleaved layout, non-aligned panic.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
@AdaWorldAPI AdaWorldAPI force-pushed the claude/burn-A5-q4-quant branch from 7609ccd to 376aacb Compare April 30, 2026 09:51
@AdaWorldAPI AdaWorldAPI merged commit 035dc41 into master Apr 30, 2026
5 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants